Significance analysis and statistical mechanics: an application to clustering 
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This paper addresses the statistical significance of structures in random data: Given a set of 
vectors and a measure of mutual similarity, how likely does a subset of these vectors form a cluster 
with enhanced similarity among its elements? The computation of this cluster p-value for ran- 
domly distributed vectors is mapped onto a well-defined problem of statistical mechanics. We solve 
this problem analytically, establishing a connection between the physics of quenched disorder and 
multiple testing statistics in clustering and related problems. In an application to gene expression 
data, we find a remarkable link between the statistical significance of a cluster and the functional 
relationships between its genes. 
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Clustering is a heavily used method to group the el- 
ements of a large dataset by mutual similarity. It is 
usually applied without information on the mechanism 
producing similar data vectors. Any clustering depends 
on two ingredients: a notion of similarity between ele- 
ments of the dataset, which leads to a scoring function 
for clusters, and an algorithmic procedure to group ele- 
ments into clusters. Diverse methods address both as- 
pects of clustering: similarities can be defined by Eu- 
clidean or by information-theoretic measures pQ, and 
there are many different clustering algorithms ranging 
from classical fc-means [5] and hierarchical clustering [3 
to recent message-passing techniques [I]. 

An important aspect of clustering is its statistical sig- 
nificance, which poses a conceptual problem beyond scor- 
ing and algorithmics. First, we have to distinguish "true" 
clusters from spurious clusters, which occur also in ran- 
dom data. An example is the starry sky: true clusters are 
galaxies with their stars bound to each other by gravity, 
but there are also spurious constellations of stars which 
are in fact unrelated and may be far from one another. 
Second, clustering procedures generally produce differ- 
ent and competing results, since their scoring function 
depends on free parameters. The most important scor- 
ing parameter weighs number versus size of clusters and 
is contained explicitly (e.g., the number k in fc-means 
clustering) or implicitly (e.g., the temperature in super- 
paramagnetic [5] and information-based clustering [I]) in 
all clustering procedures. Choosing smaller values of k 
will give fewer, but larger clusters with lower average sim- 
ilarity between elements. Larger values of k will result in 
more, but smaller clusters with higher average similarity. 
None of these choices is a priori better than any other: 
both tight and loose clusters may reflect important struc- 
tural similarities within a dataset. 

Addressing the cluster significance problem requires a 
statistical theory of clustering, which is the topic of this 




FIG. 1: Clustering a set of random vectors. In a set of 

randomly chosen vectors, subsets of vectors can arise whose 
elements share a large similarity among each other. Here 
a cluster is shown with its center of mass pointing upwards 
and the shading indicating score contributions. Large clus- 
ters with high similarity among its elements occur only in 
exponentially rare configurations of the random vectors. 



paper. Our aim is not to propose a new method for clus- 
tering, but to tell significant clusters from insignificant 
ones. The key result of the paper is the analytic com- 
putation of the so-called cluster p-value p(S), defined as 
the probability that a random data set contains a cluster 
with similarity score larger than S. This result provides a 
conceptual and practical improvement over current meth- 
ods of estimating p-values by simulation of an ensemble 
of random data sets, which are computationally intensive 
and, hence, often omitted in practice. 

Our approach is based on an intimate connection be- 
tween cluster statistics and the physics of disordered sys- 
tems. The score S of the highest-scoring cluster in a set of 
random vectors is itself a random variable, whose cumula- 
tive probability distribution defines the p-value p(S). For 
significance analysis, we are specifically interested in the 
large-S* tail of this distribution. Our calculation employs 
the statistical mechanics of a system whose Hamiltonian 
is given by (minus) the similarity score function. In this 
system, log p(S) is the entropy of all data vector configu- 
rations with energy below —S. We evaluate this entropy 
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in the thermodynamic limit where both the number of 
random vectors, and the dimension of the vector space 
are large. In this limit, the overlap of a data vector with 
a cluster center is a sum of many variables; the result- 
ing thermodynamic potentials can then be expressed in 
terms of averages over Gaussian ensembles. 

High-scoring clusters have to be found in each fixed 
configuration of the random data vectors, which act as 
quenched disorder for the statistics of clusterings. The 
disorder turns out to generate correlations between the 
scores of clusters centered on different directions of the 
data vector space. These correlations, which become par- 
ticularly significant in high-dimensional datasets, show 
that clustering is an intricate multiple-testing problem: 
spurious clusters may appear in many different directions 
of the data vectors. Here, we illustrate our results by 
application to clustering of gene expression data, where 
high-dimensional data vectors are generated by multi- 
ple measurements of a gene under different experimental 
conditions. The link between quenched disorder and mul- 
tiple testing statistics is more generic, as discussed in the 
conclusion. 

Distribution of data vectors and scoring. We consider 
an ensemble of N vectors xi, X2, . . . , xjv, which are drawn 
independently from a distribution Po(x). We are specif- 
ically interested in data vectors with a large number of 
components, M. Clusters of such vectors are generically 
supported by multiple vector components, which is the 
source of the intricate cluster statistics discussed in this 
paper. We assume that the distribution Pq(x) factor- 
izes in the vector components, -fo( x ) = Vo{ x x) ■ ■ -Po{xm) 
(this assumption can be relaxed, see below). Such null 
models are, of course, always simplifications, but they 
are useful for significance estimates in empirical data (an 
example is p- values of sequence alignments [7]). 

A subset of these vectors forms a cluster. The clustered 
vectors are distinguished by their mutual similarity, or 
equivalently, their similarity to the center z of the clus- 
ter, see Fig. 1. We consider a simple similarity measure 
of vectors, the Euclidean scalar product: each vector x 
contributes a score 
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The scoring parameter /t acts as a threshold; vectors x 
with an insufficient overlap with the cluster center z re- 
sult in a negative score contribution. The squared length 
of cluster centers is normalized to z ■ z = M. 

A cluster can now be defined as a subset of positively 
scoring vectors. The cluster score is the sum of contribu- 
tions from vectors in the cluster, 

N 

S , (xi,...,x A r|z,^) = ^max[s(x ; |z, /i),0] . (2) 

i=i 

Large values of fi result in clusters whose elements have 
a large overlap, small values result in more loose clus- 



ters. The total score is determined both by the number 
of elements and by their similarities with the cluster cen- 
ter, that is, tighter clusters with fewer elements can have 
scores comparable to those of looser but larger clusters. 
Both the direction z and width parameter fi of clusters 
are a priori unknown. 

Cluster score statistics. To describe the statistics 
of an arbitrary cluster score 5(xi, . . . , xjv) for vectors 
drawn independently from the distribution P ( x )j we 
consider the partition function 
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The second step collects all configurations of vectors 
(xi,...,xjv) with cluster score S, so p(S) denotes the 
density of states as a function of score S. Asymptotically 
for large N, this density can be extracted from Z(j3) as 



logp(S) ~ NSl(s) 
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log(gN). 
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Here f2(s) is the entropy as a function of the score per 
element, s = S/N, which is the Legendre transform of 
the reduced free energy density /3/(/3) = — log Z((3)/N, 
i.e., Q(s) = -waxplpfip) + /3s] = -/3*/03*) - P*. 
The prefactor g of the subleading term is given by g = 
2K\{d 2 /dp 2 )Pf{(5)\p =f) *. The p- value of a cluster score S 
is defined as the probability dS' p(S') to find a score 
larger or equal to S. Inserting Q shows that this p- value 
equals p(S) up to a proportionality factor of order one. 

Clusters in a fixed direction. As a first step, and to 
illustrate the generating function ([3]), we compute the 
distribution of scores for clusters with a fixed center z. 
We assume that the null distribution po for vector com- 
ponents has finite moments, set the first two moments 
to and 1 without loss of generality, and we choose z 
to lie in some direction which has non-zero overlap with 
a finite fraction of all M directions. Hence, the overlap 
Xi = Xj • z is approximately Gaussian-distributed by the 
central limit theorem. The generating function ([3]) gives 
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where the index c denotes evaluation for a fixed clus- 
ter center and H(x) — f°° dxG(x) is the cumula- 
tive distribution function of the Gaussian G(x) — 
exp(— x 2 /2)/y2w. The result is an integral over the 
component x = x • z of a data vector in the direc- 
tion of the cluster center: Below the score threshold 
/t, the component gives zero score, which contributes 
the cumulative distribution J^^dx G(x) to the parti- 
tion function. Above the score threshold, the compo- 
nent gives a positive score, which generates a contribu- 
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FIG. 2: Cluster score distributions in random data for 
fixed and optimal cluster direction. Analytical distribu- 
tions p(S) (solid lines) are plotted against the score per ele- 
ment, s = S/N, and are compared to normalized histograms 
obtained from numerical experiments with 10 6 samples (sym- 
bols), (a) Distribution p c (S) of the cluster score |U for fixed 
cluster center and datasets of TV = 6000 vectors with M = 70, 
with parameter = 0.1\/M. Error bars show the standard 
error due to the finite size of the sample, (b) Distribution of 
the maximum cluster score (|6j with parameter fi = 0.1 \/M for 
TV = 40 (triangles), TV = 80 (circles) and TV = 120 (squares), 
keeping M/N = 0.5 fixed. 



tion of J^dx G(x) exp{(3s(x\fi)}. The resulting score dis- 
tribution is given by Q, \ogp c (S) = TVfi (s = S/N) - 
(1/2) log( ffc TV), see Fig. 2(a). 

Maximal scoring clusters. To gauge the statistical sig- 
nificance of high-scoring clusters in actual datasets we 
need to know the distribution of the maximum cluster 
score in random data. The maximum cluster score is in 
turn implicitly related to the optimal cluster direction in 
a dataset: for a given subset of vectors xi, . . . , X&, the 
maximal cluster score is reached if the center z coincides 
with the "center of mass" , x av = (xi + . . . + Xk)/k. How- 
ever, adding or removing vectors shifts the center of mass 
x av of the cluster and changes the score of each vector. 
Thus, finding the maximum score for a given dataset 



This expression is to be understood in the asymptotic 
limit TV -> oo with M/N kept fixed. The result (||| in- 
volves a variation over a, which, compared to the cor- 
responding expression ([5]) for fixed cluster center, gen- 
erates an effective shift a/2 in the score cutoff fx and 
an additional entropy-like term. The calculation uses 
the so-called replica-trick [SUllj. representing the power 
n = P/ji' of the integral in Q by a product of n copies 
(replicas). The calculation proceeds for integer values of 
n, and the limit n — > (/?' — > oo) is taken by analytic 
continuation. A key ingredient is the average overlap 
q = (z-z')/M between directions of different cluster cen- 
ters for the same configuration of data vectors at finite 
temperature 1//3'. We find a unique ground state (i.e., 
q — > 1 for fi' — > oo) and a low-temperature expansion 
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of the average overlap, similar to the case of directed 
polymers in a random potential [TSJ, which arises in the 
statistics of sequence alignment [T3] . Thus, the effect of 
center optimization on the free energy density ([8| and 
on cluster p-values is related to the fluctuations between 
subleading cluster centers for the same random dataset. 

This solution determines the asymptotic form of the 
distribution of maximum cluster score S max = S as 
given by @, logp(S) = NQ{s) + O(logTV). Fig. 2(b) 
shows this result together with numerical simulations 
for several values of M and TV, producing good agree- 
ment already for moderate TV. According to ([8|, the 
effect of center optimization on score statistics increases 
with M and decreases with TV. For small M/N, we ex- 
pand the solution in TV for fixed large M and obtain 
-f3f{(3, fi) = -0f c (fi, fi) + (M/2TV) log TV + const., which 
leads to a distribution of maximum cluster scores 
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is a hard algorithmic problem, in particular for large di- 
mensions M. We calculate the distribution of 5 mM for 
independent random vectors from the generating func- 
tion ([3| with the integral representation 
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for the statistical weight of a configuration xi, . . . , Xjv- 
For large values of the auxiliary variable f3' , only direc- 
tions z with a high cluster score S(xi, . . . , xjv|z, fi) con- 
tribute to this integral over cluster directions z, and the 
maximum over the cluster score ([6| is reproduced in the 
limit j3' — > oo. We obtain 



-Pfc (P, 
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up to terms of order TV . We have generalized this calcu- 
lation to null distributions Pq with arbitrary correlations 
between vector components x 1 , . . . , x M [5]- 

The free energy density ^ was derived under the 
assumption of replica-symmetry (RS)[5], implying that 
only a single direction z yields the maximal score. This 
is appropriate for high-scoring clusters, since they occur 
in exponentially rare configurations of the random vec- 
tors, for which a second cluster direction with the same 
score would be even more unlikely. On the other hand, 
RS is known to be violated in the case (3 = 0, which de- 
scribes clusters in typical configurations of the random 
vectors. This case has been studied before in the context 
of unsupervised learning in neural networks |10j . RS is 
also likely to be broken for j3 < 0, which describes config- 
urations with score maxima biased towards values lower 
than in typical configurations. The limit (3 — > — oo is 
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FIG. 3: Statistical significance of clusters correlates 
with functional annotation for yeast expression data. 

The significance pgo of gene annotation terms vs. the cluster 
score significance, traced over a range of scoring parameter 
fi (shown by color-scale) of three representative clusters in- 
volved in translation (ribosomal genes), sulfur metabolic pro- 
cess and carbohydrate metabolic process. 



not described by a known universality class of extreme 
value statistics. It may still be computable by the meth- 
ods used here: the state space of the problem is the set 
of all hypotheses tested (here the centers and widths of 
all clusters), and configurations of data vectors generated 
by a null model act as quenched random disorder. 

Many thanks to M. Vingron for discussions and B. 
Nadler, M. Schulz, and E. Szczurek for their comments on 
the manuscript. This work was supported by Deutsche 
Forschungsgemeinschaft (DFG) grants GRK1360 (M. 
Luksza), BE 2478/2-1 (J. Berg) and SFB 680. 



relevant to the problem of sphere packing in high dimen- 
sions, for which currently only loose bounds are known. 

Application to clusters in gene expression data. Clus- 
ters with high statistical significance may contain ele- 
ments with a common mechanism causing their similar- 
ity. Here we test the link between our p- value and biolog- 
ical function of clusters in a dataset of gene expression in 
yeast [141 IT5] . We trace several high-scoring clusters over 
the range of /i where they give a positive score. As /i in- 
creases, the cluster opening-angle decreases (see Fig. 1), 
leading to a tighter, smaller cluster. The cluster p-value 
also changes continuously, and the genes contained in the 
cluster also change. We ask if specific functional annota- 
tions (gene ontology GO-terms) appear repeatedly in the 
genes of a cluster, and how likely it is for such a func- 
tional enrichment to arise by chance. We compute the 
p- value pgo(C) of the most significantly enriched GO- 
term in a cluster C, using parent-child enrichment anal- 
ysis |16j with a Bonferroni correction. A cluster with 
small poo (C) is thus significantly enriched in at least one 
GO-annotation, which points to a functional relationship 
between its genes. As shown in Fig. 3, the parame- 
ter dependence of the cluster score significance p(S(C)) 
and the significance poo (C) of gene annotation terms is 
strikingly similar. The statistical measure based on clus- 
ter score p-values thus is a good predictor of functional 
coherence of its elements. 

Conclusions. We have established a link between 
quenched disorder physics and the multiple testing statis- 
tics in clustering. This connection applies to a much 
broader class of problems, which involve the parallel test- 
ing of an exponentially large number of hypotheses on 
a single dataset. Examples include imaging data (e.g. 
fMRI) and the analysis of next-generation sequencing 
data. If the scores of different hypotheses are correlated 
with each other, the distribution of the maximal score is 
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